Red Wine Quality Exploration by Hongyan Wang
Univariate Plots Section
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] "3" "4" "5" "6" "7" "8"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 3: 10
## 1st Qu.: 9.50 4: 53
## Median :10.20 5:681
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
Most red wines have quality “5” or “6”. Most red wines have a pH between “3.210” and “3.400”. Most red wines have chlorides between “0.07” and “0.09”. 
Above 90% of red wines have quality “5”,“6” or “7”
The histogram for chlorides without first 5 % quantile and last 5 % quantile 
I’m wondering whether the chlorides influence the quanlity of red wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
##
## FALSE TRUE
## 1595 4
The minimum of chlorides is 0.012 and the maximum of chlorides is 0.661, but most chlorides are between 0.07 and 0.09. In particular, most are below 0.45, and there are several outliers.

The histogram for chlorides without first 5 % quantile and last 5 % quantile 
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
##
## FALSE TRUE
## 1591 8
The minimum of sulphates is 0.33 and the maximum of sulphates is 2.000,but most sulphates are between 0.55 and 0.73. In particular, most are below 1.5. There are some outliers.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
I wonder if the red wine quality has anything to do with the alcohol. Alcohol may have a big influence on wine.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
##
## FALSE TRUE
## 1590 9
Most wines have total.sulfur.dioxide below 150, there are some outliers.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
##
## FALSE TRUE
## 1588 11
Most wines have residual.sugar between 1.9 and 2.6, in particular, most are below 10.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
##
## FALSE TRUE
## 1595 4
Most wines have volatile.acidity between 0.39 and 0.64,there are several outliers.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## FALSE TRUE
## 1467 132
Some wines don’t have citric.acid 
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
##
## FALSE TRUE
## 1595 4
Most wines have free.sulfur below 60. There are several outliers.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
pH and density look like normal distributed.
Univariate Analysis
What is the structure of your dataset?
There are 1599 wines in the dataset with 13 features (X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol and quality.) The variable quality has levels “3”,“4”,“5”,“6”,“7”,“8”.
Most wines have quality “5”,“6”,“7”. Most wines have free.sulfur below 60. Most wines have volatile.acidity between 0.39 and 0.64. Most wines have residual.sugar between 1.9 and 2.6 Most wines have total.sulfur.dioxide below 150. Most sulphates are between 0.55 and 0.73. Most red wines have chlorides between “0.07” and “0.09”.
What is/are the main feature(s) of interest in your dataset?
The main feature I’m interested in is “quality”. I wonder which chemical properties influence the quality of red wine. So I would investigate the relationships between quality and other variables. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? I think alcohol may have a big influence on the quality. Also, I would investigate sulphates,chlorides,fixed.acidity,free.sulfur.dioxide, total.sulfur.dioxide, residual.sugar, citric.acid and volatile.acidity.pH and density look like normally distributed, I think they may have little influence on quality.
### Did you create any new variables from existing variables in the dataset? No, since I want to investigate whether these current chemical properties have influence on quality ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? Density and pH look normally distributed, so I guess they may have low influence on quality. The data are tidy, so I don’t change the form of the data. For total.sulfur.dioxide,volatile.acidity,chlorides, there may be some outliers.
Bivariate Plots Section
I want to look at scatter plots involving quality and other variables since scatterplots are one of the best ways to understand a bivariate relationship. Since the variable quality is discrete, I would use jitter plot.


##
## Pearson's product-moment correlation
##
## data: pf$sulphates and as.numeric(pf$quality)
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
It seems that when sulphates incease from 0 to 0.9, the quality increases. After 0.9, when sulphates increase, the quality decrease.

##
## Pearson's product-moment correlation
##
## data: pf$chlorides and as.numeric(pf$quality)
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
It’s hard to find a pattern for chlorides and quality.

##
## Pearson's product-moment correlation
##
## data: pf$alcohol and as.numeric(pf$quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Except several points, we can see the more alcohol, the higher quality.

##
## Pearson's product-moment correlation
##
## data: pf$free.sulfur.dioxide and as.numeric(pf$quality)
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
free.sulfur.dioxide doesn’t have much influence on wines quality

##
## Pearson's product-moment correlation
##
## data: pf$fixed.acidity and as.numeric(pf$quality)
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
The fixed.acidity doesn’t have much influence on quality.

##
## Pearson's product-moment correlation
##
## data: pf$volatile.acidity and as.numeric(pf$quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
The more volatile.acidity the wines contain, the lower quality they have.

##
## Pearson's product-moment correlation
##
## data: pf$total.sulfur.dioxide and as.numeric(pf$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
total.sulfur.dioxide doesn’t have much influence on wines quality.

##
## Pearson's product-moment correlation
##
## data: pf$citric.acid and as.numeric(pf$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
citric.acid doesn’t have much influence on quality. 
##
## Pearson's product-moment correlation
##
## data: pf$residual.sugar and as.numeric(pf$quality)
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
The residual.sugar doesn’t have much influence on quality

##
## Pearson's product-moment correlation
##
## data: pf$pH and as.numeric(pf$quality)
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
As expected, the pH doen’t have much influence on quality.

##
## Pearson's product-moment correlation
##
## data: pf$density and as.numeric(pf$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
Also, the density doen’t have much influence on quality.

It seems that free.sulfur.dioxide and total.sulfur.dioxide have a linear relationship.

Now I want to use box plots to explore the relationships between quality and alcohol, sulphates and volatile.acidity.
From this box plot, we can see that the wines whose qualities are high have high alcohol.

We can see that the wines whose qualities are high have high sulphates.
we can see that the wines whose qualities are high have low volatile.acidity.
Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I find that the features chlorides,fixed.acidity,residual.sugar, pH,free.sulfur.dioxide, citric.acid, total.sulfur.dioxide and density don’t have much influence on the quality of wines.
The quality of red wines is related to alcohol, sulphates and volatile.acidity.
The more alcohol the wines contain, the higher quality they have.For sulphates, it looks that the quality of wines increases as sulphates increase when sulphates < 0.9, then the quality of wines decreases as sulphates increase. ### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
total.sulfur.dioxide and free.sulfur.dioxide are linearly related.The more free.sulfur.dioxide the wines contain, the more total.sulfur.dioxide they contain.
Also, it seems that the more citric.acid the wines contain, the less volatile.acidity they contain.
What was the strongest relationship you found?
The strongest relationship is that quality of wines is linearly correlated with alcohol. The quality and sulphates are also correlated
Multivariate Plots Section
Look at the summary for variable sulphates first.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Then I break sulphates into two buckets (0.33,0.7] and (0.7,2]
I plot the scatter plot for alcohol and quality, colored by sulphates.

Look at the summary of alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Break alcohol into 3 buckets (8.4,10.2],(10,2,10.42],(10.42,14.90]
I plot the relationship between sulphates and quality, colored by alcohol

Plot the scatter plot for volatile.acidity and quality, colored by alcohol

Plot the scatter plot for volatile.acidity and quality, colored by sulphates

Multivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
I find that the more alcohol the wines contain, the higher quality they have. When I plot the relationship between alcohol and wine quality color by other variables like sulphates, they still follow this pattern.
Similarly, I find that the more volatile.acidity the wines contain, the lower quality they have. When I plot the relationship between alcohol and wine quality color by other variables like sulphates, they still follow this pattern.
Were there any interesting or surprising interactions between features?
The relationship between quality and alcohol is a little surprising to me. Before I deal with the data, I thought that for low alcohol wines, they have low quality and high quality; for high alcohol wines, they also have low quality and high quality. But I find that for red wines, the more alcohol they contain, the higher quality they have.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity,
## data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7186 -0.3820 -0.0641 0.4746 2.1807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.61083 0.19569 3.121 0.00183 **
## alcohol 0.30922 0.01580 19.566 < 2e-16 ***
## sulphates 0.67903 0.10080 6.737 2.26e-11 ***
## volatile.acidity -1.22140 0.09701 -12.591 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6587 on 1595 degrees of freedom
## Multiple R-squared: 0.3359, Adjusted R-squared: 0.3346
## F-statistic: 268.9 on 3 and 1595 DF, p-value: < 2.2e-16
The R2 for this linear model is about 0.33, so this model is just fine, but not very good. But we can still see the coefficients for alcohol, sulphates and volatile.acidity. The coeficient for alcohol is 0.31, the coefficient for sulphates is 0.67 and the coefficient for volatile.acidity is -1.22.
Final Plots and Summary
Plot One

Description One
I choose this plot because I’m investigating the relationships between quality and other chemical properties and it would be good to know the distribution of quality. We can find that the quality of red wines are discrete numbers, in particular, they are “3”,“4”,“5”,“6”,“7” and “8”. Then notice that most red wines have quality “5” and “6”. These two facts will increase the difficulties of finding wich chemical properties influence the quality of red wines.
Plot Two

Description Two
I choose this plot because this boxplot clearly shows The red wine qualities are highly related to alcohol. We can see the more alcohol the wines contain, the higher quality they have. For wines whose qualities are 3,4 or 5, the mean of alcohol is about 10, for wines whose qualities are 6, the mean of alcohol is about 10.5, for wines whose qualities are 7, the mean of alcohol is about 11.5, for wines whose qualities are 8, the mean of alcohol is about 12.1. From the linear model, we know the coefficient for alcohol is 0.30922, which confirms to the plot.
Plot Three

I choose this plot because it clearly shows that the red wine qualities are negatively related to volatile acidity. From the box plot for quality and volatile.acidity, we can clearly see that the more volatile.acidity the wines contain, the lower quality they have. From the linear model, we know the coefficient for volatile.acidity is -1.22140, which confirms to the plot.
Reflection
The red wines dataset contains 1559 observations and 15 variables. First, I started by understanding each variable in the dataset. Since I want to investigate the which chemical properties influence the quality of the red wines, I understand the quality variable first. I noticed that the quality variable are discrete numbers, in particular, most red wines have quality “5” or “6”. This means it will be difficult to find a clear relationship between quality and other variables since other variables have continuous values. Then I investigated the relationship between quality and other 13 variables one by one. For some variables, such as pH and density, I expected that they won’t influence the quality of wines. And the result also shows that they don’t have much influence on quality. For some variables, such as fixed.acidity, I expect that they will influence the quality of wines. But after some investigations, I didn’t find they have any clear relationship. It seems that the value of these variables don’t influence the quality of wines. After investigating all the variables, I find that the alcohol has much influence on quality, which is a little surprising to me. I find that the more alcohol the red wines contain, the highter quality they have.
Main struggles:
- The qualities for most wines are 5 or 6, which makes me feel it’s very difficult to find a clear relationship between quality and other chemical properties.
- When I do the scatter plot, which I think is a very good method to find the bivariable relationship, they look very messy. It’s hard to find the parterns for their relationship.
- When I explore the multivariable relationships, I try to plot the scatter plot for two variables, colored by the third variable. This makes the plots even worse.
Main successes:
- I try to investigate variables one by one, and find some variables have nothing to do with quality and some variables are highly correlated to quality.
- I find some variables may be related to each other( like total.sulfur.dioxide and free.sulfur.dioxide). So if I want to build the linear model, I will only use one of them.
- I use box plots to find some clear relationships between some variables(like alcohol) and quality. This makes me know which variables I should use to build my model.
- The linear model I build is not perfect, but it also gives me a good sense about the relationships between quality and other variables.
Future work:
Since the quality of wines are discrete (“3”,“4”,“5”,“6”,“7”,“8”), I think it’s a good idea to use classification algorithm to explore which chemical property influences the quality of wines. I can even use these classifier models to predict the quality of wines.